Search CORE

48 research outputs found

Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

Author: Gokhale Maya
Peng Ivy B.
Wahlgren Jacob
Publication venue
Publication date: 04/11/2022
Field of study

Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC2

arXiv.org e-Print Archive

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Author: Chien Steven W. D.
Markidis Stefano
Peng Ivy B.
Podobas Artur
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/08/2020
Field of study

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization.Comment: Accepted for publication at the 2020 International Conference on Cluster Computing (CLUSTER 2020

arXiv.org e-Print Archive

Crossref

Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations

Author: Costea Stefan
Garcia-Gasulla Marta
Markidis Stefano
Peng Ivy B.
Tskhakaya David
Williams Jeremy J.
Publication venue
Publication date: 28/06/2023
Field of study

Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.Comment: Accepted by the Euro-Par 2023 workshops (TDLPP 2023), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figure

arXiv.org e-Print Archive

sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems

Author: Bengtsson Gabriel
Chien Steven W. D.
Markidis Stefano
Nylund Jonas
Peng Ivy B.
Podobas Artur
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/08/2020
Field of study

Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.Comment: Accepted for publication at the 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2020

arXiv.org e-Print Archive

Crossref

FAST-ASKAP Synergy: Quantifying Coexistent Tidal and Ram-Pressure Strippings in the NGC 4636 Group

Author: Bekki Kenji
Boselli Alessandro
Bosma Albert
Catinella Barbara
Cortese Luca
Deg N.
Du Min
Dénes H.
Elagali Ahmed
Ho Luis C.
Kamphuis P.
Kilborn Virginia
Koribalski B. S.
Lee Bumhyun
Lee-Waddell K.
Liang Ze-Zhong
Lin Xuchen
Madrid Juan P.
Peng Eric W.
Rhee Jonghwan
Shao Li
Spekkens Kristine
Staveley-Smith Lister
Verdes-Montenegro Lourdes
Wang Huiyuan
Wang Jing
Wang Shun
Westmeier T.
Wong O. Ivy
Yang Dong
Publication venue
Publication date: 19/04/2023
Field of study

Combining new HI data from a synergetic survey of ASKAP WALLABY and FAST with the ALFALFA data, we study the effect of ram-pressure and tidal interactions in the NGC 4636 group. We develop two parameters to quantify and disentangle these two effects on gas stripping in HI-bearing galaxies: the strength of external forces at the optical-disk edge, and the outside-in extents of HI-disk stripping. We find that gas stripping is widespread in this group, affecting 80% of HI-detected non-merging galaxies, and that 34% are experiencing both types of stripping. Among the galaxies experiencing both effects, the strengths (and extents) of ram-pressure and tidal stripping are independent of each other. Both strengths are correlated with HI-disk shrinkage. The tidal strength is related to a rather uniform reddening of low-mass galaxies (

M_*<10^9\,\text{M}_\odot

) when tidal stripping is the dominating effect. In contrast, ram pressure is not clearly linked to the color-changing patterns of galaxies in the group. Combining these two stripping extents, we estimate the total stripping extent, and put forward an empirical model that can describe the decrease of HI richness as galaxies fall toward the group center. The stripping timescale we derived decreases with distance to the center, from

\mathord{\sim}1\,\text{Gyr}

around

R_{200}

\mathord{\lesssim}10\,\text{Myr}

near the center. Gas-depletion happens

\mathord{\sim}3\,\text{Gyr}

since crossing

2R_{200}

for HI-rich galaxies, but much quicker for HI-poor ones. Our results quantify in a physically motivated way the details and processes of environmental-effects-driven galaxy evolution, and might assist in analyzing hydrodynamic simulations in an observational way.Comment: 44 pages, 22 figures, 5 tables, accepted for publication in ApJ. Tables 4 and 5 are also available in machine-readable for

arXiv.org e-Print Archive

Hypoplastic Left Heart Syndrome Current Considerations and Expectations

Author: Benson D. Woodrow
Bhatt Ami B.
Caldarone Christopher A.
Cheatham John P.
Cohen Meryl S.
Daniels Curt J.
Deal Barbara J.
Dubin Anne M.
Feinstein Jeffrey A.
Furck Anke K.
Gaynor J. William
Ghanayem Nancy S.
Ivy D. Dunbar
Johnson Beth Ann
Mahle William T.
Marsden Alison L.
Martin Gerard R.
Maxey Dawn M.
Morales David L.
Mussatto Kathleen A.
Ohye Richard G.
Pahl Elfriede
Peng Lynn F.
Rosenthal Geoffrey L.
Rudd Nancy A.
Tweddell James S.
Tworetzky Wayne
Villafañe Juan
Publication venue: American College of Cardiology Foundation. Published by Elsevier Inc.
Publication date: 03/01/2012
Field of study

In the recent era, no congenital heart defect has undergone a more dramatic change in diagnostic approach, management, and outcomes than hypoplastic left heart syndrome (HLHS). During this time, survival to the age of 5 years (including Fontan) has ranged from 50% to 69%, but current expectations are that 70% of newborns born today with HLHS may reach adulthood. Although the 3-stage treatment approach to HLHS is now well founded, there is significant variation among centers. In this white paper, we present the current state of the art in our understanding and treatment of HLHS during the stages of care: 1) pre-Stage I: fetal and neonatal assessment and management; 2) Stage I: perioperative care, interstage monitoring, and management strategies; 3) Stage II: surgeries; 4) Stage III: Fontan surgery; and 5) long-term follow-up. Issues surrounding the genetics of HLHS, developmental outcomes, and quality of life are addressed in addition to the many other considerations for caring for this group of complex patients

Elsevier - Publisher Connector

George Washington University: Health Sciences Research Commons (HSRC)